author: Timothy H. Keitt date: May 12, 2014 width: 1440 height: 900
type: section
Tim Keitt <tkeitt@utexas.edu>
Daniel Mitchell <daniel.lewis.mitchell@gmail.com>
I study the spatial and temporal organization of populations, communities and ecosystems.
More at http://www.keittlab.org/
DBI standard now widely usedrgdal number 62 of top 100 downloaded packages on CRANsp classessp integrated into rgdalrgdal2 package using Rcpptype: section
type: section
library(sp)
data(meuse)
coordinates(meuse) <- ~x+y # coerce to SpatialPointsDataFrame
library(lattice) # spplot is based on the lattice package
trellis.par.set(sp.theme()) # load some custom colors into lattice graphics
spplot(meuse, "dist")
library(sp)
library(maptools) # used here to read shapefiles
orig.wd = getwd()
setwd("example-data/NCEAS sample")
states <- readShapePoly("western-states")
reservoirs <- readShapePoly("western-reservoirs")
rivers <- readShapeLines("western-rivers")
dams <- readShapePoints("western-dams")
setwd(orig.wd)
plot.it = function()
{
plot(states, border = "wheat3", col = "wheat1")
lines(rivers, col = "dodgerblue3")
plot(reservoirs, col = "darkgreen", border = "darkgreen", add = TRUE)
points(dams, cex = 1.4, col = "darkred")
text(dams, labels = as.character(dams$DAM_NAME), col = "darkred", cex = 0.6, font = 2, offset = 0.5, adj = c(0,2))
title("Major Dams of the Western United States")
legend("bottomleft", legend = c("Major rivers", "Reservoirs", "Dams"), title = "Legend", bty = "n", inset = 0.05, lty = c( 1,-1,-1), pch = c(-1,15, 1), col = c("dodgerblue3", "darkgreen", "darkred"))
}
type: section
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Topologies - Networks - Accuracy and precision - Map projections - Spatial indices
Vector data - Points - Lines - Polygons
Raster data - Spatial grids - Lookup tables
Topology - Areal data - Vertices - Faces
Network - Relational - Vertices - Edges - Planar network - Topology is a special case
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
OGC Simple Feature Hierarchy
OGC Simple Feature Hierarchy
OGC Simple Feature Predicates
OGC Simple Feature Set Operations
Handled by GEOS library; bindings in rgeos package
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
Raster data
Location is implicit using row and column offsets
Raster data
Raster data
For geospatially registered data, two coordinate systems - Row and column offsets - Geospatial coordinates
For many spatial reference systems, conversion from map coordinates to raster offsets can be achieved via an affine transform.
\[ \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} + \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} c \\ r \end{bmatrix} \]
\[ \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}^{-1} \left( \begin{bmatrix} x \\ y \end{bmatrix} - \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} \right) = \begin{bmatrix} c \\ r \end{bmatrix} \]
If \(a_{11}\) or \(a_{22}\) are negative, then axis is “flipped”.
Affine transforms
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
Accuracy and precision
Accuracy and precision
Resampling
Resampling methods
Accuracy and precision
Accuracy and precision
Complex versus simple features
Sources of imprecision - Measurement, recording and transcription errors - Have to clean the data - “Snapping” to reduced precision coordinates sometimes works - Numerical imprecision - Floating point round-off and other effects - Snapping may help - Infinite precision arithmetic possible
Complex versus simple features
Most spatial operators in R require simple features
Complex versus simple features
Most spatial operators in R require simple features
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
Map projections
Map projections
Latitude and longitude are natural coordinates
Map projections
Nonetheless they are still relative to a model of earth
Map projections
Map projections
Map projections
Map projections
Example systems
Example coordinate systems
Example coordinate systems
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
Map queries
Map queries
Map queries
type: sub-section - Types of geospatial data - Vectors and simple features - Raster data - Accuracy and precision - Map projections - Spatial indices
type: section
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
R interprets expressions
x = 3; print(x)
[1] 3
y = 2 * x + 4; print(y)
[1] 10
Getting help
help(ls)
?ls
??predict
help(package = stats)
Try googling “<topic> in R”
Session environment
library(nlme) # attach package
getwd() # where am I?
setwd("my.dir") # go there
ls() # list R objects
dir() # list files
q() # all done
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Basic IO
x = read.csv('the-data.csv') # 90% of what I do
x = read.table('other-data.asc', header = TRUE, as.is = TRUE)
write.csv(object, file = 'output.csv')
These return data frames. More on that in a bit.
Database access
library(RPostgreSQL)
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname = "testing")
res = dbSendQuery(con, "select length from coastlines")
dframe = fetch(res)
Enables powerful SQL queries to RDMS. See also the sqldf package.
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Assignment
a = 1
b <- 2
3 -> c
print(a); print(b); print(c)
[1] 1
[1] 2
[1] 3
I will use = and not <- or ->. R does not care.
Control flow (compact)
if ( TRUE ) print("yes") else print("no")
[1] "yes"
z = rep(1, 10); print(z)
[1] 1 1 1 1 1 1 1 1 1 1
for ( i in 3:10 ) z[i] = z[i-1] + z[i - 2]; print(z)
[1] 1 1 2 3 5 8 13 21 34 55
if and for are sufficient for the vast majority of programs
Control flow (better layout)
i = 7
while ( i )
{
z[i] = z[i] / 2
i = i - 1
if ( i < 3 ) break
}
print(z) # divide elements 4:7 by 2
[1] 1.0 1.0 1.0 1.5 2.5 4.0 6.5 21.0 34.0 55.0
while is less common but useful in cases with an indeterminate number of loops
Control flow (better layout)
i = 7
while ( i )
{
z[i] = z[i] / 2
i = i - 1
if ( i < 3 ) break
}
f(x), x[1]type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
R expressions are vectorized
x = 1:5; print(x)
[1] 1 2 3 4 5
y = 2 * x; print(y)
[1] 2 4 6 8 10
The rule is that the expression is evaluated for each element
z = 1:5; print(2 * z)
[1] 2 4 6 8 10
for ( i in 1:5 )
{
z[i] = 2 * z[i]
}
print(z)
[1] 2 4 6 8 10
Functions may or may not return a value for each element of an input vector
print(sqrt(1:5)) # a 'map'
[1] 1.000 1.414 1.732 2.000 2.236
print(sum(1:5)) # a 'reduce'
[1] 15
print(summary(1:5)) # a more complex 'reduce'
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 2 3 3 4 5
Can be tricky in complex code
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
2-D indices are nested
a = matrix(1:9, 3)
print(a)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
print(a[1:2, 3:2])
[,1] [,2]
[1,] 7 4
[2,] 8 5
2-D indices are nested
b = matrix(NA, 2, 2)
ri = 1:2; ci = 3:2
for ( i in seq(along = ri) ) # for each index in ri
for ( j in seq(along = ci) ) # loop over indices in ci
b[i, j] = a[ri[i], ci[j]] # ith ri and jth ci
print(b)
[,1] [,2]
[1,] 7 4
[2,] 8 5
print(a[ri, ci]) # equivalent expression
[,1] [,2]
[1,] 7 4
[2,] 8 5
You can omit braces for a single loop expression (not preferred)
Rule is if no comma, then use each element of index
print(a)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
print(a[1:5]) # 1-D index gives 1-D result
[1] 1 2 3 4 5
Matrices are stored column-wise
Index can be higher dimensional
i = which(diag(3) == 1, arr.ind = TRUE); print(i)
row col
[1,] 1 1
[2,] 2 2
[3,] 3 3
print(diag(a[i])) # a[i[1,]], a[i[2,]]...
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 5 0
[3,] 0 0 9
print(a[i[, 1], i[, 2]]) # a[i[1, 1], i[1, 2]]...
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Note the difference
print(a[i])
[1] 1 5 9
print(a[i[, 1], i[, 2]])
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Multiple indices separated by commas are nested
Using indices creatively is often the fastest way to extract and rearrange data in R
i = rep(1:3, each = 2); print(i)
[1] 1 1 2 2 3 3
print(a[i, i])
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 4 4 7 7
[2,] 1 1 4 4 7 7
[3,] 2 2 5 5 8 8
[4,] 2 2 5 5 8 8
[5,] 3 3 6 6 9 9
[6,] 3 3 6 6 9 9
Using indices creatively is often the fastest way to extract and rearrange data in R
i = 1:3; print(sample(i))
[1] 2 1 3
print(a[sample(i), sample(i)]) # row-column permutation
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 1 4 7
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Functions
f = function(a, b = 2, c = NULL)
{
res = a * b
if ( ! is.null(c) ) res = res * c
return(res)
}
print(f(1, 2, 3)); print(f(4))
[1] 6
[1] 8
Using NULL as a flag facilitates reuse
Functions are objects
print(class(f))
[1] "function"
print(formals(f))
$a
$b
[1] 2
$c
NULL
Functions are objects
print(body(f))
{
res = a * b
if (!is.null(c))
res = res * c
return(res)
}
body(f) = "gotcha"
print(f(6))
[1] "gotcha"
Scope
c = "mice" # global
f = function(a = "three") # function formals
{
b = "blind" # function body
return(paste(a, b, c)) # three scopes
}
print(f())
[1] "three blind mice"
Object not found in current scope initiates search upward into enclosing scopes
Scope
x = 2 # global scope
f = function() x = 2 * x # 2 different x's here
print(x); print(f()); print(x)
[1] 2
[1] 4
[1] 2
The assignment in the function body creates a variable x whose scope is the function body
Useful for closures
f = function()
{
x = sum(rnorm(100)) # 1-time stuff here
function(y) x * y # return a function
}
g = f() # closure factory
print(c(g(2), f()(2))) # uses x created by f()
[1] -30.34 21.47
The factory function is just a convenient way to bind an anonymous environment to the returned closure. I use this all the time to speed up calculations.
Function objects and closures are the key to functional programming
Not functional
x = matrix(rnorm(25), 5)
row.sums = rep(NA, 5)
for ( i in 1:5 ) row.sums[i] = sum(x[i, ])
print(row.sums)
[1] 0.6581 -3.5508 3.7498 1.2681 -0.3079
Function objects and closures are the key to functional programming
Functional
f = function(i) sum(x[i, ]) # x global scope
row.sums = sapply(1:5, f) # f(1), f(2)...
print(row.sums)
[1] 0.6581 -3.5508 3.7498 1.2681 -0.3079
lapply family of functions use C-level looping = fasttype: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Arrays, including vectors and matrices, hold one type; lists hold different types
# coerced to character
print(c(1, 'a', TRUE))
[1] "1" "a" "TRUE"
# retain individual types
print(list(1, 'a', TRUE))
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
Single v. double braces
x = as.list(1:3)
print(x[2]); print(class(x[2])) # returns list
[[1]]
[1] 2
[1] "list"
print(x[[2]]); print(class(x[[2]])) # returns list element
[1] 2
[1] "integer"
Data frames are lists of vectors
a = 1:3
b = factor(1:3)
c = letters[1:3]
x = data.frame(a = a, b = b, c = c)
print(x)
a b c
1 1 1 a
2 2 2 b
3 3 3 c
print(names(x))
[1] "a" "b" "c"
Use list operators to extract columns
print(x$a)
[1] 1 2 3
print(x[[2]]) # a vector of factors
[1] 1 2 3
Levels: 1 2 3
print(x[['c']]) # different factors
[1] a b c
Levels: a b c
Or matrix or vector indexing
print(x[1:2, 2:3])
b c
1 1 a
2 2 b
print(x[2])
b
1 1
2 2
3 3
Subsetting
y = subset(x, b %in% 2:3, select = c(b, c))
print(y)
b c
2 2 b
3 3 c
Lots of new fancy packages for manipulating data frames. See reshape, plyr, dplyr, sqldf and others.
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Raison d’être of R is modeling
x = rnorm(100)
y = 1 + 2 * x + rnorm(100)
plot(y ~ x)
Raison d’être of R is modeling
mod1 = lm(y ~ x) # y = b0 + b1 * x
summary(mod1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.5983 -0.7540 -0.0803 0.6819 2.4079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.054 0.103 10.2 <2e-16 ***
x 2.075 0.106 19.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.03 on 98 degrees of freedom
Multiple R-squared: 0.795, Adjusted R-squared: 0.793
F-statistic: 381 on 1 and 98 DF, p-value: <2e-16
S3 classes and methods
print(class(mod1)) # S3 class
[1] "lm"
methods(class = class(mod1)) # S3 methods
[1] add1.lm* alias.lm* anova.lm*
[4] case.names.lm* confint.lm cooks.distance.lm*
[7] deviance.lm* dfbeta.lm* dfbetas.lm*
[10] drop1.lm* dummy.coef.lm effects.lm*
[13] extractAIC.lm* family.lm* formula.lm*
[16] hatvalues.lm* influence.lm* kappa.lm
[19] labels.lm* logLik.lm* model.frame.lm*
[22] model.matrix.lm nobs.lm* plot.lm*
[25] predict.lm print.lm* proj.lm*
[28] qqnorm.lm* qr.lm* residuals.lm
[31] rstandard.lm* rstudent.lm* simulate.lm*
[34] summary.lm variable.names.lm* vcov.lm*
Non-visible functions are asterisked
Calling a generic method
plot(mod1)
Raison d’être of R is modeling
anova(mod1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 404 404 381 <2e-16 ***
Residuals 98 104 1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model with x is much better
Model syntax
Y ~ X # Y = B0 + B1 * X
Y ~ 0 + X # Y = B1 * X
Y ~ X1 + X2 # Y = B0 + B1 * X1 + B2 * X2
Y ~ X1 * X2 # Y = B0 + B1 * X1 + B2 * X2 + B3 * X1 * X2
Y ~ I(X / 2) # Y = B0 + B1 * (X / 2)
I evaluates argument as a regular R expressionPeaking under the hood
str(mod1[1:6])
List of 6
$ coefficients : Named num [1:2] 1.05 2.08
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
$ residuals : Named num [1:100] -0.243 -0.283 -0.848 0.539 1.863 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ effects : Named num [1:100] -11.892 -20.09 -0.804 0.554 1.912 ...
..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:100] 0.767 2.011 2.875 0.578 3.303 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
- S3 objects are usually lists with a class attribute - str can be helpful with “what the heck is that?”
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Mixed effects with S4 classes \[Y = X \beta + Zu + \epsilon\]
library(lme4)
x = rnorm(100)
z = rbinom(100, 1, 0.5)
y = 1 + 2 * x - 3 * z + rnorm(100)
Mixed effects with S4 classes \[Y = X \beta + Zu + \epsilon\]
library(lattice)
xyplot(y ~ x | z, cex = 3) # cex is graphical parameter
Mixed effects with S4 classes \[Y = X \beta + Zu + \epsilon\]
mod2 = lmer(y ~ x + (1 | z)) # z is random intercept
print(mod2)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
REML criterion at convergence: 281.9
Random effects:
Groups Name Std.Dev.
z (Intercept) 2.007
Residual 0.946
Number of obs: 100, groups: z, 2
Fixed Effects:
(Intercept) x
-0.436 2.214
What is mod2?
class(mod2)
[1] "lmerMod"
attr(,"package")
[1] "lme4"
isS4(mod2)
[1] TRUE
A generic method applied to S4 object
ranef(mod2)
$z
(Intercept)
0 1.416
1 -1.416
y was constructed with -3 * zS4 object have slots
slotNames(mod2)
[1] "resp" "Gp" "call" "frame" "flist" "cnms" "lower"
[8] "theta" "beta" "u" "devcomp" "pp" "optinfo"
show(mod2@beta) # show is S4 print
[1] -0.4365 2.2141
Fixed effects stored in beta
Access S4 slots with @ or slot method
head(mod2@frame, n = 3) # the data
y x z
1 2.808 0.9571 0
2 -3.441 -1.4161 0
3 -4.325 -2.7417 0
head(slot(mod2, 'frame'), n = 3)
y x z
1 2.808 0.9571 0
2 -3.441 -1.4161 0
3 -4.325 -2.7417 0
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
Iterators increment or decrement on each call rather than existing as a vector of values.
library(iterators)
i = iter(1:3)
c(nextElem(i), nextElem(i), nextElem(i))
[1] 1 2 3
Iterators can be convenient but really shine when they allow a computation to proceed incrementally without storing the entire iterator sequence in memory.
foreach evaluates an expression for each value of an iterator sequence
library(foreach)
foreach(i = 1:3) %do% rnorm(1)
[[1]]
[1] -0.3558
[[2]]
[1] 0.7485
[[3]]
[1] -2.969
The real action is in the %do% infix operator.
The .combine argument gives more control
library(foreach)
foreach(i = 1:3, .combine = c) %do% rnorm(1)
[1] 1.2698 0.1409 -1.5221
This is a form of map (the expression) and reduce (the .combine function)
Something a bit more challenging
f = function(n = 1e7) prod(rnorm(n))
system.time(f())
user system elapsed
1.097 0.026 1.123
About 1.5 seconds on my laptop
Something a bit more challenging
system.time(
foreach(i = 1:5, .combine = prod) %do% f()
)
user system elapsed
5.028 0.056 5.084
About 6.5 seconds on my laptop
Now the really cool part
library(doMC) # multicore extensions
registerDoMC() # create parallel engine
system.time(
foreach(i = 1:5, .combine = prod) %dopar% f()
)
user system elapsed
0.014 0.031 1.427
About 2.5 seconds on my laptop. Here %dopar% automatically splits the computation across all of the available cores.
Note: this may crash RStudio under some circumstances. Better to run %dopar% on the command line.
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops
R does linear algebra
x = 1:3 # [1, 2, 3]
x %*% x # inner product
[,1]
[1,] 14
x %*% t(x) # outer product
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
R does linear algebra
a = matrix(1:9, 3); print(a)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
a %*% x # matrix - vector product
[,1]
[1,] 30
[2,] 36
[3,] 42
R does linear algebra
a %*% a # matrix - matrix product
[,1] [,2] [,3]
[1,] 30 66 102
[2,] 36 81 126
[3,] 42 96 150
eigen(a) # eigenvalue decomposition
$values
[1] 1.612e+01 -1.117e+00 -5.701e-16
$vectors
[,1] [,2] [,3]
[1,] -0.4645 -0.8829 0.4082
[2,] -0.5708 -0.2395 -0.8165
[3,] -0.6770 0.4039 0.4082
R does linear algebra
type: sub-section - Getting started - Getting data in and out - Basic syntax and control flow - Vectorized expressions - Array indices - Functions and functional programming - Lists and data frames - Model syntax and S3 methods - Model syntax and S4 methods - Iterators and foreach - Matrix-vector ops